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Abstract 

Introduction. The banking sector assigns high priority to data storage, as it is a critical aspect of business operations. 
The volume of data in this area is steadily growing. With the increasing volume of data that needs to be stored, 
processed and analyzed, it is critically important to select a suitable data storage solution and develop the required 
architecture. The presented research is aimed at filling the gap in the existing knowledge of the data base management 
system (DBMS) suitable for the banking sector, as well as to suggest ways for a fault-tolerant data storage cluster. The 
purpose of the work is to analyze the key DBMS for analytical queries, determine the priorities of the DBMS for the 
banking sector, and develop a fault-tolerant data storage cluster. To meet the performance and scalability requirements, 
a data storage solution with a fault-tolerant architecture that meets the requirements of the banking sector has been 
proposed. 

Materials and Methods. Domain analysis allowed us to create a set of characteristics that a DBMS for analytical 
queries (OnLine Analytical processing —- OLAP) should correspond to, compare some popular DBMS OLAP, and 
offer a fault-tolerant cluster configuration written in xml, supported by the ClickHouse DBMS. Automation was done 
using Ansible Playbook. It was integrated with the Gitlab version control system and Jinja templates. Thus, rapid 
deployment of the configuration on all nodes of the cluster was achieved. 

Results. For OLAP databases, criteria were developed and several popular systems were compared. As a result, a 
reliable cluster configuration that met the requirements of analytical queries has been proposed for the banking industry. 
To increase the reliability and scalability of the DBMS, the deployment process was automated. Detailed diagrams of 
the cluster configuration were also provided. 

Discussion and Conclusions. The compiled criteria for the DBMS OLAP allowed us to determine the need for this 
solution in the organization. Comparison of popular DBMS can be used by organizations to minimize costs when 
selecting a solution. The proposed configuration of the data warehouse cluster for analytical queries in the banking 
sector will improve the reliability of the DBMS and meet the requirements for subsequent scalability. Automation of 
cluster deployment by the mechanism of templating configuration files in Ansible Playbook provides configuring a 


ready-made cluster on new servers in minutes. 
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AnHOoTanHa 

Beedenue. bankosckul ceKTop mpusaeT OombuIoe 3HAaYeHHe XpaHeHHIO TaHHBbIX, MOCKOJIbKY 9TO KPHTHYCCKH BaxKHBIM 
acliekT Ou3Hec-onepalHi. OObeM aHHbIX B WaHHOM ciepe HeyKIOHHO pacTeT. C yBesH4YeHHeM OOBeEMOB JaHHBIX, 
KOTOpble HeEOOXOAMMO XpaHUTb, OOpabaTHIBaTbh M aHaJIM3HpOBaTb, KpaliHe BaxKHO BbIOpaTbh NOAXOAAWee pelieHHe DA 
XpaHeHHa JJaHHbIX HM pa3spadoTaTb HeoOxogzuMyro apxuTeKtypy. IIpeyctaBieHHoe HMcceqoBaHve HalpaBeHo Ha TO, 
YTOOBI 3aN0JHHTb upoben B cylecTByIomjMx 3HaHHAx CYB], nogzxopsawHx Id OaHKOBCKOrO ceKTopa, a TakoKe 
TIpeAIOKUTh CHOCOObI AI OTKA30YCTOMYMBOTO KJlacTepa xpaHeHHaA aHHBIX. Llerb paOoTbl — aHasIM3 KJIIOUCBBIX 
CYB] wa anamuTwueckux 3anpocos, onmpesemenve mpHoputetos CYB] ana OaHKoBcKoro ceKTopa MH pa3pa0oTKa 
OTKa30ycTOM4MBOrO KylacTepa xpaHeHHa anHbIx. Jia BbINOMHeHHA TpeOoBaHHH K MpOM3BOAHTebHOCTH U 
MaclUTaOMpyeMOCTH TipeANOXKeHO pellieHHe JIT XpaHeHHA TaHHbIX C OTKa30yCTOMYMBOM apXHTeKTypoH, OTBeYAaTOLee 
TpeOoBaHHAM OaHKOBCKOrO CeKTopa. 

Mamepuaavi u memoooi. Ananu3 ipeqMeTHO OOacTH NO3BONMI CO31aTb Hadop XapaKTepHCTHK, KOTOPbIM JODKHAa 
cooTBeTcTBoBaTb CYbJ] aa avamuTM4eckux 3armpocos (OLAP), BsinoHHTbh cpaBHeHHe HeKOTOPBIX MOMYyApHbIX 
OLAP CYB], u mpeqnoxuTs oTKa30ycTOMuMBylO KacTepHy!O KOH®urypalMio, HallucaHHyIo Ha s3bIKe xml, 
nosepxuBaemyro CYB] ClickHouse. ABToMaTu3alHA BbIMOHeHa c HOoMoubIO Ansible Playbooks. On mHTerpupoBaH 
c CHcTeMOM yupaBleHua Bepcnamu Gitlab u miaOnoHamn Jinja. Takum oOpa30M ocTuraeTca ObICTpoe pa3sBepTEIBaHHe 
KOH(urypalluu Ha BCexX HOax KlacTepa. 

Pezyismamot uccnedosanua. na 0a3 QanHerx OLAP Osim pa3spaboTaHbI KpHTepHu, MpoBeyeH cpaBHUTeJIbHbIi 
aHaJIM3 HECKOJIBKMX MOMMyIAPHbIX cHcTeM. B pesynbTaTe Oba WpewoxwKeHa HayexKHad KlacTepHad KOHurypalHA B 
OaHKOBCKOM MHAYCTpHu, KOTOpad yOBeTBOpseT TpeOoBaHHAM aHaIMTMYeCKHX 3anpocos. Jina yBerM4eHHa 
HajjexKHOCTH HM MacuITaOupyemoctu CYB] mporecc pa3BepTpiBaHua OBI aBTOMaTH3NpoBaH. Takxke IpHBeeHbI 
WeTasIbHble CXeMbI KOHUTypalHn KslacTepa. 

O6écyorcdenue u 3aKnio“uenua. CocTaBileHHble KpuTepun ya OLAP CYB/J] nosBonsaroT ompeseuTh HeOOXO,HMOCTb 
JaHHOrO pelieHHA B OpraHH3auun. CpaBHenve nonyapHErx CYB] MoxerT OpiTb UciOb30BaHO OpraHH3alHAMH JIA 
MHHMMV3aHH 3aTpaT upH BbIOope pewienua. IIpeqnaraemaa KOHuTypalua KacTepa xpaHvIMUla WaHHbIX JA 
aHaJIMTHYeCKHX 3alIpOCoB B OaHKOBCKOl ciepe MO3BOJIMT MOBbICHTb HayexHocTs CYB] u yqoBleTBopuTs TpedoBaHHa 
K TlocleyyrouleH MaciuTaOupyemocTu. ABTOMaTH3al[HA pa3BepTbIBAHHA KjlacTepa MyTeM MexaHH3Ma WadOHH3alHu 
KOH@UTypalHOHHBIx dalioBn B Ansible Playbooks no3BomaeT HaCTpOHTb TOTOBbI KIacTep Ha HOBbIX CepBepax 3a 


MHHYTBI. 
K.unovespre copa: CYB, OLAP, xpanumnure qanHbrx, ClickHouse, orka30ycTOH4MBbIit KacTep. 
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Introduction. Data storage in the banking sector is one of the key business factors. To ensure the security of 
customer information and transactions, it is required to take measures of protection, distribution and creation of 
backups. For operational analysis, employees should be able to make operational analytical requests to the data 
warehouse, while not interfering with the work of other processes within the organization and without causing a heavy 
load on the storage itself. Databases and Data Warehouse are information systems in which data is stored, but they are 
also used to solve various tasks. The article describes what such systems do, what the main differences between them 
are, and why their effective use is essential for business development. 

Many organizations make mistakes in designing the architecture of databases and data warehouses, losing sight of 
aspects of information security, scalability and fault tolerance. The urgency of this problem is due to the intensive 
development of systems in banks, the expansion of their fields of application and the increase in the amount of data in 
need of constant analysis. For operational analysis of a large amount of data, a storage is needed that must meet all 
reliability and security requirements. 

Effective decision-making processes in business depend on high-quality information. In today's competitive 
business environment, flexible access to a data warehouse is required, organized in such a way as to increase business 
productivity, provide fast, accurate and up-to-date data understanding. The data warehouse architecture is designed to 
meet such requirements and is the basis of these processes [1—5]. 

The objective of the work is to determine the priority DBMS for performing analytical queries in the banking sector 
and design a fault-tolerant data warehouse cluster. This solution will significantly increase the speed of execution of 
analytical queries, solve problems with scalability and reliability of the data warehouse. 

Materials and Methods. The database stores real-time information about one specific part of the business. Its main 
task is to process daily transactions. Databases use Online Transaction Processing (OLTP) to quickly delete, insert, 
replace and update a large number of short online transactions. 

Data warehouse is a system that collects data from lots of different sources within an organization for reporting and 
analysis, using operational analytical processing (OLAP) to quickly analyze large amounts of data. This system focuses 
on reading, rather than changing historical data from lots of different sources, therefore, compliance with ACID 
(Atomic, Consistent, Isolated and Durable) requirements is less strict. Data warehouses perform complex functions of 
aggregation, analysis and comparison of data to support management decision-making in companies. 

A warehouse in the banking sector may contain: 

— user account information (personal data, addresses, phone numbers); 

— information about banking products and services (loans, deposits, plastic cards, mobile banking, etc.); 

— data on transactions (including card transactions) in minimal detail for the last three years; 

— information about accounts, balances on them, etc. 

To meet the needs for OLAP, there are separate types of database management systems (DBMS) [3-6]. Each of the 
systems has its own characteristics in the construction of architecture. 

To perform an effective analysis of compliance with these requirements, warehouses must: 

— have a high capacity capable of accommodating huge amounts of data (billions or trillions of rows); 

— be organized as wide tables with multiple columns; 


— perform queries with a small number of columns; 
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— have a high query execution speed (in milliseconds or seconds); 

— provide for most of the read-only requests; 

— support fast bulk data loading when updating (more than 1,000 rows at a time) and adding, but without changing 
them; 

— have high throughput to process a single request (up to billions of rows); 

— have high reliability; 

— ensure data security and consistency. 

For the OLAP scenario of work in the banking sector, it is preferable to use column-based analytical databases, 
since they can store a lot of columns in a table, which will not affect the speed of reading data. Column-based DBMS 
provide strong compression of data in columns, since data in one column of the table is usually of the same type, which 
cannot be said about a row. They also enable to get a tenfold increase in query execution speed on lower-power 
equipment. At the same time, thanks to compression, the data will occupy 5-10 times less space on the disk than in the 
case of traditional DBMS [7-11]. 

During the requirements analysis, the following column DBMS were selected: ClickHouse, Vertica, Amazon 
Redshift. 

ClickHouse is the preferred solution due to the following advantages: open source; it is possible to define some or 
all structures that will be stored only in memory; high speed; good data compression; http and command line interface; 
cluster can be scaled horizontally; high availability; ease of installation and configuration. Installation is carried out on 
the organization's servers in an isolated segment, which meets the security requirements for sensitive data in the 
banking sector. The DBMS is also included in the register of domestic software; therefore, it provides implementing 
this software product in state-owned companies. 

Amazon Redshift solution is provided only as a cloud service. For organizations from the banking sector that cannot 
place their data in the clouds for a number of security-related reasons, this product loses its appeal. 

Vertica is an alternative version of ClickHouse with a paid license for large clusters and the installability on the 
company's local servers. 

The implementation of the distributed data warehouse architecture is presented below. To increase fault tolerance 
and performance, the implementation of a distributed ClickHouse failover cluster with three shards and two replicas is 
proposed. 

Sharding (horizontal scaling) makes it possible to write and store parts of data in a distributed cluster, process and 
read them in parallel on all nodes of the cluster, increasing data throughput. 

Replication is copying data to multiple servers; thus, each bit of data can be found on multiple nodes. 

Scalability is determined by sharding or segmentation of data. The reliability of the data warehouse is determined by 
data replication [12-16]. 

Sharding and replication are completely independent, different processes are responsible for them. It is required to 
localize small data sets on one shard and ensure a fairly even distribution across different shards in the cluster. To do 
this, it is recommended to take the hash function value from a field in the table as a sharding key. 

Sharding and replication are completely independent, different processes are responsible for them. It is required to 
localize small data sets on one shard and ensure a fairly even distribution across different shards in the cluster. To do 
this, it is recommended to take the hash function value from a field in the table as a sharding key. 

Depending on the number of available resources and servers, it is proposed to implement this configuration on 3 or 6 


nodes. For a production environment, it is recommended to use a cluster of 6 nodes. It should be noted that replication 
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does not depend on sharding mechanisms and works at the level of individual tables, and also, since the replication 
coefficient is 2, each shard is represented in 2 nodes [17—19]. Configuration options are described below. 

The logical topology diagram is as follows: 

3(Shard) x 2(Replicas) = Clickhouse Cluster of 6 nodes. 
The probability of trouble-free operation of a system with 2 replicas and 3 shards on 6 nodes is equal to: 
P, =[1-G-p)’]’. 

The probability of trouble-free operation is an objective possibility that the system will work for time ¢ without 
restorations [7, 13]. 

Thus, a table containing 30 million rows will be distributed evenly across 3 nodes of the cluster. The remaining 
3 nodes will store replicas of the data. When one of the cluster nodes is disabled, data will be taken from another 


available node that contains its replica, thereby achieving reliability [20]. A cluster of 6 nodes is shown in Figure 1. 


node O01 node 02 node 03 


shard 01 shard 02 shard 02 
replica 01 replica 01 replica 02 


node 04 node 05 node 06 


shard 03 shard 03 shard 01 
replica 01 replica 02 replica 02 


Fig. 1. Fault-tolerant cluster of 6 nodes (the authors’ figure) 
To replicate data and execute distributed DDL queries, we need to use +1 node with ZooKeeper installed. You can 
also use ClickHouse Keeper, compatible with ZooKeeper, which does not require installation on a separate server. 
An example of a fragment of the configuration file is shown in Figure 2, from which it can be seen that the shard has 


replication configured for the Ist and 6th nodes. 


<yandex> 
<remote_servers> 
<cluster_1> 


<shard> 
<weight>1</weight> 
<internal_replication>true</internal_replication> 
<replica> 
<host>{{ node1 }}</host> 
<port>9000</port> 
</replica> 
<replica> 
<host>{{ node6 }}</host> 
<port>9000</port> 
</replica> 
</shard> 


Fig. 2. Fragment of the configuration file for 6 nodes (the authors’ figure) 
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An option of the cluster configuration of 3 nodes with cyclic replication is shown in Figure 3. 


node O01 node 02 node 03 


shard 01 ina shard 03 
replica 01 Rsaceeha replica 01 
replica 01 


shard 03 shard 01 shard 02 
replica 02 replica 02 replica 02 


Fig. 3. Fault-tolerant cluster of 3 nodes (the authors’ figure) 


This implementation requires two different segments located on each node. The main problem arises due to the fact 
that each shard has the same table name, ClickHouse cannot distinguish one shard/replica from another when they are 
located on the same server. 

To solve this problem, it is needed: 

— to place each shard in a separate database (schema); 

— to set default_database for each shard; 

— to set internal_replication parameter of each shard to true; 

— to use an empty database parameter in a distributed table DDL script. 

For this topology in an industrial environment, 6 server nodes are required, where each server stores data from only 
one segment, a security trapdoor for a separate database is not required. To save resources in the development or testing 
area, a configuration with 3 nodes can be used. 

Automation is performed using Ansible Playbooks and integrated with Gitlab version control system. Thus, rapid 
deployment of the configuration on all nodes of the cluster is provided. When changing the configuration, it can be 
applied to all nodes with a single command or deploy a new DBMS cluster in a few minutes [21]. 

Research Results. The fault-tolerant cluster of the analytical DBMS provides redundancy for important system 
components, which allows for continuous operation even in case of errors in individual cluster nodes. This is done 
through load balancing, data replication between cluster nodes, and high reliability of the components used in the 
cluster. The result is an increase in the availability and reliability of the analytical DBMS, which is business-critical 
when analytical queries play a key role. The fault-tolerant cluster configuration of the data warehouse for analytical 
queries in the banking sector, taking into account the automation of the deployment process, enables to increase the 
reliability of the analytical data warehouse and meet the requirements for scalability. The developed task of automating 
cluster deployment using the mechanism of templating configuration files in Ansible Playbooks provides for the 
configuration of a ready-made cluster on new servers in a few minutes. The tasks of the template include operations to 
install the required packages, create the needed configuration and launch the cluster. 

An example of configuration files for automatic deployment of a DBMS cluster is shown in Figure 4. The j2 
extension says that they are created using the Jinja template engine. Purpose-built placeholders in the template provide 
writing code similar to Python syntax. Parameters are passed to the template for automatic insertion into the final 
document, thereby achieving automatic assembly into development, testing and industrial operation zones, which does 


not require manual modification of configuration files. 
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’ templates 
clickhouse_config.xml.j2 
clickhouse_keeper.xml.j2 
clickhouse_Idap_auth.xml.j2 
clickhouse_ldap_user_directory.xml.j2 
clickhouse_macro_n1.xml.j2 


clickhouse_macro_n2.xml.j2 


clickhouse_macro_n3.xml.j2 


clickhouse_macro_n4.xml.j2 
clickhouse_macro_n5.xml.j2 
clickhouse_macro_n6.xml.j2 
clickhouse_users.xml.j2 


cluster.xml.j2 


Fig. 4. Configuration files 


Description of configuration files: 

clickhouse_config.xml.j2 — general cluster configuration; 

clickhouse_keeper.xml.j2 —- zookeeper configuration, which is responsible for node synchronization and 
replication; 

clickhouse_Idap_auth.xml.j2 — LDAP connection configuration for data security; 

clickhouse_ldap_user_directory.xml.j2 — role-based configuration by access groups to ensure data security; 

clickhouse_macro_n1(6).xml.j2 — macro files (each node has its own); 

clickhouse_users.xml.j2 — configuration file for creating local users needed for administration; 

cluster.xml.j2 — cluster configuration file. 

To test the reliability of this configuration, an experiment was conducted during which data was loaded into a 
DBMS cluster with a replication factor equal to 2. The dwh schemas and cluster_test_data tables were created on each 
of the nodes of the DBMS cluster, and a distributed table was created on the dwh cluster.cluster_test_data_distributed. 
The rows of the dwh.test_data_distributed table distributed across the cluster were 27,547,855. The rows of the 
dwh.cluster_test_data table with each of the cluster nodes are listed below: 

9,186,544 rows — Ist node; 

9,182,959 rows — 2nd node; 

9,182,959 rows — 3rd node; 

9,178,352 rows — 4th node; 

9,178,352 rows — 5th node; 

9,186,544 rows — 6th node. 

Conspicuously, the table was distributed over the entire cluster. According to the configuration shown in Figure 1, 
the replication factor was 2, which means that each data block would be presented on 2 nodes. This can be seen from 
the number of rows on the nodes: the sixth node stored a copy of the first, the third — a copy of the second, the fifth — 
a copy of the fourth. 

The fault tolerance of this configuration can be checked by alternately disabling nodes in the cluster. To do this, you 
can turn off the node or stop services on one of the nodes with the systemctl stop clickhouse-server command. During 
the experiment, DBMS services were stopped on cluster nodes. 

With simultaneous disconnection of the 3rd, 4th, 6th or Ist, 2nd, 5th nodes that contained replicas, users continued 
to receive data from the dwh.cluster_test_data_distributed table, and the number of rows was equal to 27,547,855. 
When one of the nodes was disabled, the data continued to be displayed, and the number of rows was equal to 
27,547,855. When the nodes containing the replica and the original data were disconnected at the same time, data loss 
occurred. This configuration can be scaled to 12 nodes, then the replication coefficient will be 3, and the sharding 
coefficient will be 6. 

Discussion and Conclusions. The proposed solution can increase the speed of execution of analytical queries, solve 
problems with the scalability and reliability of data storage in banking organizations. The authors have automated 


cluster deployment by using templates in Ansible Playbooks, which provides setting up a ready-made cluster on new 
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servers in minutes. This configuration can be scaled by increasing the number of nodes and adding them to the 
configuration files. 

A set of characteristics that DBMS OLAP should correspond to was indicated, DBMS comparison was performed, a 
fault-tolerant cluster configuration of a data warehouse for analytical queries in the banking sector was proposed, 
automation of the configuration deployment process was performed. A similar solution is applicable for deployment on 
FreeBSD, Linux, macOS. The cluster configuration diagrams are given. This configuration can solve the problem of 


reliability and scalability, which is often found in organizations. 
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